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Abstract. We define a Hidden Markov Model (HMM) in which each hidden state has time- 
dependent activity levels that drive transitions and emissions, and show how to estimate its 
parameters. Our construction is motivated by the problem of inferring human mobility on 
sub-daily time scales from, for example, mobile phone records. 


1. Introduction 

Hidden Markov models (HMMs) are stochastic models for systems with a set of unobserved 
states between which the system hops stochastically, sometimes emitting a signal from some 
alphabet, with probabilities that depend upon the current state. The situation in which 
we are specifically interested is human mobility, partially observed, i.e., occasional signals 
about a person’s location. For example, consider the cells of a mobile phone network, from 
which a user can make calls. In this case the states of a HMM are the cells, and the emitted 
signals are the cell itself, if a call is made by a particular user during each of a sequence of 
time intervals, or nothing (0), if that user does not make a call. In the latter case, the state 
(location) of the user is ‘hidden’, and must be inferred, while in the former case, assuming 
no errors in the data, the ‘hidden’ state is revealed by the call record. * 1 Since these are data 
from a mobile phone network, a user can move from cell to cell. 

Although many analyses of human mobility have estimated no more than rather crude 
statistics like the radius of gyration, the fraction of time spent at each location, or the 
entropy of the timeseries of locations [1-4], others have used HMMs to describe partially 
observed human mobility and have estimated their parameters [5-7]. With short time steps, 
however, a standard HMM (with time-independent parameters) is not a plausible model, 
since human mobility behavior changes according to, for example, the time of day [2,3,8,9]. 
We would like to create, therefore, a HMM with time-dependent parameters. Of course, 
allowing, for example, arbitrary transition/emission probabilities at each time step, would 
lead to an extremely underdetermined model. Rather, we need a model with only a few 
additional parameters to capture the time-dependence of human mobility. Since the total 
numbers of trips [2, 8, 9] and mobile phone calls [3,10] vary with time of day and day of 
week, we develop a time-dependent HMM in which the non-trivial transition and emission 
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Road balancing, in which calls may be routed through cell towers that are not the closest, makes this 
not strictly true. The general model we consider here allows for this possibility. 

1 



2 


ESTIMATING AN ADHMM 


probabilities are proportional to activity levels, i.e., to some given functions modeling how 
active humans are at different times and places. 

Since the transition and emission probabilities in our HMM are not constant in time, 
it is a non-stationary HMM. Many generalizations of HMMs have been considered previ¬ 
ously, of course, as more faithful models of various real systems. Some of these are non- 
stationary: Deng, for example, considers a class of models in which the emission probabil¬ 
ities are somewhat non-Markovian, depending on a number of previous emissions, and also 
have polynomial-in-time trend components which are to be estimated [11]. Duration HMMs 
(DHMMs), first suggested by Ferguson [12], allow a sort of non-stationarity in the state 
transition process by including a randomly chosen duration each time the state changes, i.e., 
a number of time steps without a transition away from that state. This kind of model has 
been generalized to make the transition probabilities functions of the number of steps the 
system has been in the current state [13]. In a different direction, since one can think of 
transitions between the hidden states with different emission probability distributions as a 
kind of non-stationarity, triplet Markov chains (TMCs) include an auxiliary set of underly¬ 
ing states, each of which corresponds to a different stationary regime for a HMM [14], Our 
approach is different than that of DHMMs and TMCs in that the time dependence of the 
transition and emission probabilities is not intrinsic and random, but rather exogenous and 
deterministic. Furthermore, unlike Deng’s models [11], we take the “trend” part of the time 
dependence to be given, not an additional (set of) parameter(s) to be estimated. 

Formally, our model consists of N possible hidden states, and we denote by ( X t ) G [1V] T 
the time series of T hidden states (for any n G N, [n] = {1,..., n}). State transitions happen 
according to a sequence of matrices giving the conditional probabilities of transitions, 

A(t) = ( aij(t )), where a^t) = Pr(X t+1 = i \ X t = j). 

At each state we observe an emission that takes a value from the set [M\ U {0}, where 0 
denotes “absence of an emission”. Let y = (y t ) G ([M] U {0}) T be the series of observed 
emissions. The probability of emission s at time t, from state j, is 


b sj (t) = Pr (Y t = s | X t = j). 


We define time varying activity levels for state j by a pair of functions with non-negative 
values, ( fjiQj) '■ [T] —> [0, l] 2 , the activity functions, which modulate transitions and emis¬ 
sions from the state, respectively. Given a state j, the transition probabilities from that 
state are functions of transition parameters (r^ > 0), i ^ j G [N], and the activity level: 


(G 


1 ~ if i=j, 


subject to the constraints that for all j G [. N] and all t G [T], 


( 1 ) 


hi*) r p' - L 

ie[N],iyj 


( 2 ) 


In practice we may have a priori knowledge that some transitions do not occur, so that (a^) 
(and (r,j)) have some entries set to 0. For example, with 10 minute time intervals and states 
representing cells in a mobile phone network, transitions between sufficiently distant cells 
are precluded. 
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Similarly, we assign emission parameters ( e S j > 0), s G [M], to each state j. The emission 
probabilities are 


b sj (t) = 




SJ 


if s 7 ^ 0 ; 


1 -9j(t)'E ae[M] e aj if s = 0, 


subject to the constraints that for all j G [N] and all t G [T], 

9j(t) ^ 1 e sj 5; 1- 


( 3 ) 

( 4 ) 


s£[M] 


Denote the initial distribution over states by 7 Tj = Pr(Ad = j), subject to the constraints 
77j > 0 and 

Y = L ( 5 ) 

je[N] 

Were this a typical hidden Markov model, we could estimate its parameters using the 
Baum-Welch algorithm [15,16]. Since it is not, we develop a novel Expectation Maximization 
(EM) algorithm [17] to estimate the parameters 0 = ((tt,), (t^), (e s _,-)), given y, fj(t ) and 
gj{t), as follows. 


2. Expectation Maximization 

The expectation maximization algorithm maximizes, at each iterative step, the (expected) 
log-likelihood function described below. Let X be the set of all possible time series of states 
and let Q k be the estimate of 0 at the k-th iteration of the algorithm. 

e‘= ((*?),(*£). (4»- (0 

The algorithm begins by initializing the parameter estimates in the first (k = 1) iteration. 
Then the k + 1 st iteration consists of two steps: 

(i) Compute the expectation value of the log-likelihood, using the current (fc th ) estimate 
for the parameters: 

£(8, €>*) = log[Pr(i, a; ©)] Pr(i | a; 0‘), (7) 

xex 

where Pr(-; 0) means Pr(-) in a probability distribution parametrized by 0. 

(ii) Find the parameters that maximize the expected log-likelihood: 

0 fc+1 = arg max £(@, 0 fc ), 
e 

subject to constraints in the inequalities ([2]), (SJ) and ©. 

As in the regular Baum-Welch algorithm, we express our computations in terms of certain 
conditional probabilities based on the parameters estimated at the k th iteration, 

7 k (t) = Pr(X t = j\y-e k ), 

( 8 ) 

tf j (t) = Pr(X t =j,X t+ 1 =z\y,O k ). 

Some reindexing of eq. (J7J) yields the following expression for £(0,0 fc ) in terms of these 
probabilities: 

T-1 T 

C(G,Q k ) = log( 7r i)7i(l)+ Y (*)■ (9) 

je[V] i,je[N] t =1 JG[V] t= 1 
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We iterate steps 
in eq. (jHJ) above. 


and 


n 


for which we compute 7 k {t) and £-(£) i n eqs. ©, and £( 0 , 0 fc ) 
We continue until some standard of convergence is achieved. We then 


output the final Q k as our estimate of 0 . 


Theorem 2.1. There is a constrained expectation maximization algorithm giving a sequence 
of estimates 0 fc that converges to a critical point of the likelihood function, which is the max¬ 
imum likelihood estimate 0 for the observed sequence y when the initial guess is sufficiently 
close. Further, to achieve a precision e in the estimates, the time complexity of the algo¬ 
rithm is 0[(N 2 + M)Tlog(T/e)). In particular, for a fixed precision e, the time complexity 
is 0((N 2 + M)T log T). 


Proof. Suppose we have the estimates 0 fc defined in eq. ((HD from step k of the algorithm. We 
proceed to compute ~fj +1 (t) and £^ + 1 (t) in eqs. (JH]) to begin the next iteration. Just as in the 
regular Baum-Welch algorithm, we apply dynamic programming. Denote the k th estimate 
of the transition matrix by A k (t ) = (ah(f)), where 


^r k if i = j. 

(10) 

Similarly, 

Ik (f] _ / 9j ( t ) e k sj if s ^ 0; 

sj[ \ 1 - 9jif) E sg[m] tsj ^ s = 0. 

(11) 

ft is convenient to define B k {t) to be the diagonal matrix with jj th entry b k t j{t). Now compute 
two sequences of (co)vectors, a k (t ) G ~R N and (3 k (t) G (M Ar )f, recursively, as follows: 

a k {l) = B k ( l)7T fc ; 

a k {t) = B k (t)A k {t - 1 )a k (t - 1); 

P k (T) = 1 J B k (T)- 
f3 k {t) = p k (t + l)A k {t)B k (t). 



Then 2 

3 P k (t)B k {t)-'a k (t) ’ 

k+1(t) = /ff(* + l)aij(*)Qj(*) 
13 (3 k (t + T)A k {t)a k {f) 


The estimates for the initial probabilities 717 are the same as in the normal Baum-Welch 
algorithm, as is clear from the expression for £(0, @ fc ) in eq. ([9]). Thus 


it k+1 = 7 ? + 1 (l). 


2 With a slight notational abuse, B k {t) 1 denotes the diagonal matrix whose jj th entry is l/b k t j(t) if 
b k At) A 0 and 1 otherwise. 
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Now notice that all the constraints in (J2]) and 
by the strongest constraints: for all j G [N], 


necessary to define £(0, @ fc ) are implied 


fj ^ Fj < 1, 


where f* = max te[T _i] f 3 (t ), and 


r+j 


9i Y ^ ^ 

se[M] 


( 12 ) 


(13) 


N 


where g* = max ie[T] g 3 (t). 

Consider the computation of f*j +1 , for i ^ j G [N], It should lie in the domain Fj C 
defined by constraints (1I2|) and the non-negativity of the parameters. Since these constraints 
are independent for different j s, we can consider each j separately, and find the optimal 
parameters t 13 by computing the critical points of £(@, @ fc ) relative to (ly,-) e Fj. Using 
eqs. ([HD and (HD, 

T_1 T_1 mm (14) 


<9£(0,0 fc ) 


drij 


1 \U hi'-Ajh 


r+i r n 


If the left sum in eq. ([HD , Ht=i £^(£) = 0, the derivative is nonpositive, so £(@,@ fc ) 
is weakly decreasing and = 0 gives its largest value. If the right sum in eq. ([HD, 
J2t=i Fj) = 0) the derivative is nonnegative, so £(©, 0 fc ) is weakly 

increasing and takes its maximum value when r l3 is as large as possible, i.e., when it satu¬ 
rates constraint We will show how to handle this situation after discussing the generic 
case which we do next. 

Assuming then that neither sum in eq. (TT4T) is 0, to find the stationary points of £(@, Q k ) 
we set eq. (fl4l) to 0 and solve for t 13 . Specifically, t 13 must satisfy 


T—1 




T—1 


mm 


'V +- 


f=l 


^ 1 

t =1 


fj{t) Fj 


(15) 


We note that if fj(t ) = 1, which makes the transition probabilities, a l3 , time indepen¬ 
dent, then the solution to eq. 03 is the familiar Baum-Welch solution: a k j = r k = 

Yt=i Yt=\ 7j(f) for a H h j G [N]. For non-constant activity functions, however, 

the solution is more complicated. 

Since the right side of eq. (1T5]) is manifestly independent of i, the left side must be, too. 
Let 

T - ■ T~ ■ ■ 

(16) 


T i = 


Fj _ Fj 




^k ’ 


where the last expression uses the antiderivative convention that for a function of t denoted 
by a letter in lower case, the corresponding upper case letter 3 represents its sum over its 
domain of definition (t — 1 to T — 1 in this case). Now 

Y AT = Tj Y E rj = T ^YY £j M ■ 

r+j 


r+j 


t= 1 r^j 


3" 


is upper case £ A is upper case A; M is upper case /i; N is upper case u; V is upper case 7. 
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If we denote the probability of moving away from state j by 

r^j 

we can write 'Yhr^j T rj = T j^j- Substituting in eq. (TT5i) . this gives: 


i v m&l 


or equivalently: 

~ h i h - W) Mf' 


(17) 


We must solve this equation for Tj, whence we can use eq. (1T6|) to solve for each of the r^. 
Since Tj is nonnegative and constraint (fl2|) must hold, we recast these conditions in terms 
of Tj as: 


f*M k < 


Mi 


E 


r+j Tr i 


= — < OO. 
Tj 


Let 



max fj it) 
te[r- i]|4(t)^o 



Each of the terms in the sum on the right side of eq. (fT71) is strictly decreasing in 1/tj when 
it is well-defined (1 /tj > f k *M k ). Values of 1/tj just larger than f k *M k make the sum 
arbitrarily large, and as 1/rj increases from that value, the sum decreases monotonically to 
0, so exactly one value of Tj < l/(f k *M k ) will satisfy eq. (ITT)) . We can solve the equation 
numerically to find this value, call it rj\ If rj' < l/(f*M k ), splitting it proportionally to 
according to eq. (H6l) gives the unique critical point (r tJ ) £ Fj of £(©, & k ). 

We can compute explicitly the Hessian of £(©, © k ) with respect to the r v -; its components 
are: 

a 2 £(e,») Ef. v 

3T,idT,,, t § r .y 


Thus, as a matrix the Hessian can be written as the sum of two matrices: 


( d 2 C(Q,Q k ) 

y dTijdTj'j 


A 


K 


'l j 


\ 



SF fjWZjjW ,iT 

tr(i-/,WE^r r ,) 2 ’ 


where 1 G is the vector of all Is. Each of the matrices on the right is negative semi- 
definite, so the Hessian is also. Thus the (unique) critical point we found in this case is a 
global maximum of £(©, © fc ) in Fj and hence the choice for f*) +1 . 

If the solution does not satisfy the original constraint (fT2]h he., r? > l/(f*M k ), or if the 
right side of eq. (fT^l) is 0, the maximum will be on the boundary of Fj. Thus we maximize 
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£((-), @ fc ) subject to the boundary constraint 


J2 Ti i= p' 

j&e[N] Jj 

This is in the form of the constraint in the regular Baum-Welch algorithm with 1 replaced 
by 1 //* and the self-transition probability set to 0. Thus the critical T t j can be computed 
as in the Baum-Welch algorithm, with T k , replaced by M k and the solution divided by /*: 


T. ■ = 

*7 


W 


f-M k 


This is the unique critical point in this case, and the global maximum of £(@, @ fc ) in Fj, by 
the same argument as in the Baum-Welch algorithm. This becomes the choice for r^ +1 . 

We turn to the computation of e^ +1 , s G [M]. As before, we begin by finding the stationary 
points of £(@,0 fc ), now relative to (e S j) G Gj C M M dehned by constraints (TT3j) and the 
non-negativity of these parameters. Using eqs. ([9]) and ([3]) gives: 


<9£(0,0 fc ) 
de s 


-Ed«w-E 




-so ^sj t=1 t=1 1 9 3 VO Ylle[M] e b 

As we did for eq. (HU), we must consider the situations when either of the sums in eq. (1T5D 
vanishes. When the left sum is 0, a extreme value is given by e S j = 0, and when the right 
sum is 0, constraint (fT3l) is saturated. Assuming neither of the sums vanishes, we find the 
stationary points by solving 


J o ,yt ■ 


(18) 




9j(t)li(t ) 


9j(t) J2ie[M] 6 ij 


0 ,yt ■ 


(19) 


■' i.i ,.i 

Solution to the emission equations, eqs. (fT9l) . follows using the same steps as for the transition 
equations, eqs. (HU. We first denote the probability of emission s G [M] from state j by: 4 5 

A k sj (t) = 7 ft)6 aty 

in terms of which we rewrite eq. (USD as 

^ T T 

mi 7; ]AA i t. i 

( 20 ) 


',yn 


e sj t=1 t=1 1 9j{t)z2ie[M] e ij 


We define (independent of s) 


e . = 

J A fc -' 


( 21 ) 


We also dehne the probability of any non-zero emission from state j, 

■#) = E 


lG[M] 


4 In our notation, the usual Baum-Welch estimate is ah = fh = Sh/r* for all i,j. 

5 In a simplified model for the mobile phone data we discussed in the Introduction, M = N and every 
state j emits either the signal j or 0; in other words, e S j = 0 if s ^ j £ [M], This simplifies the following 
expressions: For t such that y t = j , 7 = 1 (the state is j with certainty if the observed emission is j). 
i.e., Ah(t) = S j>yt , and A k sj (t) = 0 if s ^ j £ [M]. 
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which, used in eq. ([20]) . gives 


Let 


i = £ 


^ 1 / e j - 9j(t)N} 


( 22 ) 


g k * = max 

te[T]\\ k 0j (t)^o 


9j (t) <g*j 


As before, there is exactly one solution e“) < 1 /(g^Nj) to eq. (1221) . We can find it numerically, 
and if e C j < l/(g*Nj), we use eq. d2T|) to find all the e S j, which will then satisfy constraint (fT3j) . 
Again we can compute the ffessian explicitly to confirm that this is a global maximum of 
£(0, Q k ), now in Gj, and hence the choice for 


~k +1 

'sj 


If > l/(g*N k ), or if the right side of eq. (j20l) is 0, we must find instead the critical point 
on the boundary of Gf 

1 

z sj * 

g* 


= 


se[M] 

Again, this is in the form of the constraint in the regular Baum-Welch algorithm. Accord¬ 
ingly, we set the critical e S j to the Baum-Welch estimate, rescaled by g*: 


e sj 


A k . 

SJ 


g*N 


k ' 


This is the unique critical point and the global maximum of £(@,@ A: ) in Gj, and therefore 
the choice for e* +1 in this case. 


S] 


We have shown how to find Q k+1 maximizing £(@, @ A: ) in eq. (JH]). This algorithm converges 
as claimed because it is an instance of expectation maximization. To understand its time 
complexity we must consider the numerical solution of eqs. (jT7[) and (]22j) . To simplify 
notation we rewrite eq. (fT7[) in terms of u = I/t,-, and a function w(u), 


w(u ) = 


T—1 

V — 

^^ 7/ — 

t 


mem 


1, 


(23) 


as w(u) — 0. As an initial estimate for the root, u c , we can use u L > f k *M k such that 


fk*ck (+k*\ 

f k *M k 


= 1, 


U 


(24) 


where t k * G [T — 1] satisfies = f k * and £jj(tj*) ^ 0. u L < u c since we found it using 

only one of the nonnegative terms in the sum in eq. 


Now recall that fj{t)£jj(t) < f k * for t G [T — 1], and 

o < Mj = N dw = E £$w < E W) < T - !• 

te[T- 1] te[T-i] mj te[T-1] 

Thus each term in the sum in eq. (1231) is no more than 


/, 


k* 


U 




k* ‘ 
3 
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At u R = 2 (T — l)fj* this is 1 j(T — 1), so w{u R ) < 0, which implies 

u L < u c < u R = 2{T - (25) 

Using Newton’s method, once we have an initial estimate “sufficiently close” to the root of 
w{u) = 0, the time complexity to find it with error less than e is 0(Tlog(l/e)), where the T 
comes from the cost of evaluating w{u) and w'{u ) at each iteration; in practice this is how we 
would fold the root. Since the length of the interval in (l25]l is 0(T), however, the bisection 
method gets us to precision e with 0(log(T/e)) steps, with total cost 0(Tlog(T/e)); thus 
this is the total complexity. 

We need to solve eqs. (fT7l) and (1221) N and M times, respectively, at each iteration, which 
thus adds 0((N + M)T log(T/e)) to the 0(N 2 T ) complexity of the computations for 7 4 fc+1 
and Thus the time complexity for the whole algorithm is O (( N 2 + M)T log(T/e)). □ 


3. Numerical Simulations 


To demonstrate the effect of the activity functions we consider a simple model with N = 3 
states and the same number of possible emissions (M = 3). From any state j, we only 
allow an emission to be either its own label j or 0, i.e., e S j = 0 for j ^ s 6 [M], so a 
non-zero emission uniquely identifies the state that emits it. We choose random transition 
and emission parameters: 



( T ij ) 


(0.770347, 0.579213, 0.0821789); 

/ 0.298244 0.0621274 

0.134788 0.3710750 

\ 0.383490 0.182008 


) 


where the omitted values are the components for which i — j. 

We generate sequences of length T = 24 • 6 • 7 • 200 (we may think of this as 200 weeks, 
with an observation every 10 minutes). We consider activity functions with variations that 
may approximate observed data, i.e., periodic variations with a period of 24 ■ 6 (one day). 
Specifically, our numerical simulations use the following three functions: 


(i) constant function, 

(ii) raised cosine, 


(iii) shifted cosine 


i{t) = i; 


r n (t) 


n — cos (2irt /(24 • 6)) 
n + 1 


Cjit) 


1 

3 


2 — cos 


/ 2tt (t - 6j) V 
V 24 -6 ) ' 


We generate a random sequence of states, x , and resulting emissions, y, using the transition 
and emission parameters above, and a pair of activity functions (a list of these pairs is shown 
in Table EJ). 

Before computing the sequence of parameter estimates, we need to specify how we compute 
initial estimates to start the iteration. This can only depend on the observed emission 
sequence y , since in any real scenario x is unknown. As a first guess, for this simple model, we 
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interpolate the state sequence x as follows : 6 For every pair of successive non-zero emissions, 
there is a segment of zeros (no emission) separating them. We divide each such segment 
into two subsegments: Let j G [M\ be the emission immediately preceding the segment, and 
i G [M] be the emission immediately following the segment. The second subsegment starts 
at the first time step after the one where fj first attains its maximum value on the segment 
(he., a time at which there is the maximum probability of hopping from state j to state 
i). We assign state j to the time steps in the first subsegment and the state i to those in 
the second. If the emission sequence y starts with a segment of zeros, then that segment 
is assigned the value of the first non-zero emission; similarly a terminal sequence of zeros is 
given the value of the last non-zero emission. Denote the interpolated states by z = ( zt ), 
ie[T], 

From z, we compute the estimate (frj) for the initial distribution over the states ( 717 ) by 
their frequencies of occurrence. For the initial r l3 estimate, f f, we use the method described 
in the proof of Theorem 12.11 to solve eq. (HU), using £*)( t ) = 5 i : Zt + 1 5 j jZt , including i = j. For 
the initial 6jj estimate, e b, we also use the method described in the proof of Theorem 12.II to 
solve eq. (HU, using t j(t) = 5 j)Zt . 

To understand the performance of the algorithm in Theorem 12.11 we need a measure of 
the error between the estimates and the real parameter values. The relative entropy is one 
measure for a stationary HMM. In our case we need to account for the time variation of the 
transition and emission probabilities, ap(t) and b S j(t). Hence we define a modified version 
of a relative entropy error criterion, the averaged relative entropy. 


Definition 3.1. Let Q = ( Q t ) and V = (P t ), t G [ T ], be two finite sequences of discrete 
probability distributions on a finite set X. The Averaged Relative Entropy (ARE) of V with 
respect to Q is 

ARECP, <2) = 1 £RE(P„e t ), 

1 tem 

where the usual relative entropy (RE) is given by 


RE {P tl Q t ) = Y J Pt{i) log 

iGX 


Pt(i) 


Thus the error function that we compute for given (t 13 ) and estimate (fb) is 

T((t„), (f‘)) = are((o,,(()), (a* («))), 

where ay(t) and b!- 3 {t) are related through f 3 to t %3 and fb by eq. dU and eq. (1TU1) . respectively. 
Similarly, for (e SJ ) and estimate (e(?j), tl ie error function is 

£.(M, ( 4 -)) = ARE((6. j (i)), few)). 

where b S] (t) and feb(f) are related through g 3 to e S j and e k v by eq. (J3J) and eq. (HU, respectively. 
(Remember that we are considering the simple case in which the emission y t = s G { 0 , j } 
when the state x t = j.) 

The pairs of activity functions fj and g 3 that we simulate numerically are described in 
Table [T] where the column indices label these pairs. 


6 For more general models, finding initial parameter estimates will be more complicated, depending on 
the particulars of the model. 
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a 

b 

c 

d 

e 

f 

g 

h 

fi 

1 

1 

l 

7T 

c j 

r 2 

n 

c j 

9j 

C 3 

n 

l 

T\ 

C 3 

1 

l 

l 


Table 1. Functions used in numerical simulations 


For each set of pairs of activity functions in Tabled] we run the algorithm for 50 iterations. 
Figures dl and [2] plot the averaged relative entropy for the parameter estimates as a function 
of iteration step. The labels (a)-(h) to the right of each plot appear in the order of the final 
error values. We do not provide a plot showing convergence of (tt } - ) since the only noticeable 
trend is that if they converge to an exact state value, it is usually to the initial state of the 
interpolated sequence z. 

In each case the error for both the transitions and the emissions decreases to small values. 
Since the ARE depends on the activity functions as well as on the parameters and their 
estimates, we need to compute a baseline error value for each case. For the parameters (r^) 
and a specific choice of (/,■) it is: 


B r ((T v ),(fi)) = E ARE((oy(t)),(o^(i)) 


where a,j(t) and a' %3 (t) are related through fj to and r- 3 , respectively, by eq. (d]), and 
where E[-] denotes the expectation over uniformly random (rh). To estimate this expectation 
value, we compute the average ARE of the parameters with respect to 1000 independently 
chosen sets of random parameters (rather than their estimates from our algorithm), for each 
case (a)-(h). For the emission parameters we compute baselines the same way, using 1000 
uniformly random values (eh) to estimate 


B e ((e a J,(g j ))=E 


ARE 




where b S j(t) and b' S3 (t) are related through g 3 to e S j and eh, respectively, by eq. (|3]), and 
where E[-] denotes the expectation value over uniformly random (eh). The baseline averages 
thus obtained for function pairs in Table di are recorded in Table [21 where the row labels 
indicate the parameters being baselined. 



a 

b 

c 

d 

e 

f 

g 

h 


1.637 

1.677 

1.657 

0.615 

0.603 

0.811 

0.81 

0.610 


0.603 

0.578 

1.563 

0.592 

0.584 

1.466 

1.539 

1.534 


TABLE 2. Baseline errors for (r^) and (e S j) for activity function pairs from Table dl 

We plot these baseline errors as horizontal lines in Figures d! and [2] Most of these are too 
close to be distinguishable; indeed they are all 0(1), in contrast to the estimation errors plots 
which are almost all smaller by at least an order of magnitude, and in most cases by 3 or 4, 
indicating very good parameter estimates. Furthermore, the relative quality of the estimates 
can be understood: Case (c) is the standard HMM, for which our algorithm reduces to the 
Baum-Welch algorithm [15,16]. Cases (a) and (b) have greater errors, which is not surprising 
since they have non-constant emission activity functions, oscillating in value up to 1. This 
means that for each of these cases, non-zero emissions are lower probability events, so there 
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is less information in y. Possibly surprising is the fact that when the transition activity 
function is non-constant, cases (d)-(h), the errors are smaller than in the standard HMM 
case. But this happens because state changing transitions are reduced, so that each non-zero 
emission observed provides more information. And among these cases, those with varying 
emission activity levels have larger errors than those without. 


Estimation error for transition parameters 


Sr((ty).(r <j)) 
10 r 




. 

'■C--.. . 


Iterations (k) 


a 

b 

c 

d 

e 

f 

9 

h 


Figure 1. Transition parameters estimation error, £ r ((r y -), (t^)), for suc¬ 
cessive iterations k. Labels to the right are of activity function pairs from 
Table [T] and are displayed in the order of the final error values. Baseline 
errors, <Bt((Aj), (fj)), from Tableware shown as horizontal lines. 


Estimation error for emission parameters 


Se«%),(E „)) 


a.((%).fo)) 


. . . . 

. . . . 

*. •. . . 

. . 


Iterations (k) 


a 

b 

c 

d 

e 

f 

9 
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Figure 2. Emission parameters estimation error, £ e ((e SJ -), (e^-)), for succes¬ 
sive iterations k. Labels to the right are of activity function pairs from Ta¬ 
ble [U and are displayed in the order of the final error values. Baseline errors, 
B e ((e S j), ( gj )), from Tableware shown as horizontal lines. 
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